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4.  Introduction 


i 


1 1  Abstract  , .  ,  >-'rj's 

i..  •  ' 

1  The  object  of  our  research  ii  the  development  of  prop  sin  mine  knowledge  for  the  synthesis  of 
concurrent  program*.  In  this  final  report  we  describe  techniques  for  synthesising  efficient  parallel 
structures  from  high  level  specifications  of  a  problem.  These  structures  contain  collections  of  trees 
interconnected  in  various  ways.  We  examine  an  apparently  divers*  group  of  problems  and  show 
that  they  all  have  properties  in  common  that  allow  these  syntheses  to  be  performed  using  only  a  few 
synthesis  rules.  Wt  also  explore  some  alternative  syntheses  for  some  structures.  Some  of  the  synthesis 
paths  use  transformation  rules  designed  to  produce  parallel  structures  containing  multidimensional 
lattices.  These  lattices  are  then  transformed  into  structures  containing  trees  in  some  cases.  In  other 
cases  the  lattice  structure  is  better  and  is  retained.  In  yet  other  cases  the  lattice  structure  is  modified 
to  make  a  better  lattice  structure.  _ 


$2  Introduction 

The  purpose  of  this  report  is  to  explore  certain  aspects  of  generating  parallel  computation  structures 
from  high-level  specifications.  To  study  this  problem  we  will  use  a  series  of  classical  problems  from 
computer  science  and  show  how  efficient  implementations  could  be  produced  semiautomatically  or 
automatically.  We  are  designing  TransConS,  the  TRANSformational  CONcurrency  Synthesiser, 
to  reduce  to  practice  the  principles  we  enunciate  in  this  report.  TransConS  will  require  a  theorem 
prover.  Earlier  versions  of  the  system  will  us*  human  assistance  to  replace  or  supplement  a  theorem 
prover.  This  assistance  takes  the  form  of  allowing  the  system  to  ask  the  user  to  assert  a  critical 
theorem  or  axiom. 


$$2.1  Motivation 

The  system  we  describe  in  this  report  accepts  high-level  descriptions  of  a  problem  and  will  produces  a 
description  of  a  parallel  structure  to  the  topology  level.  It  does  not  describe  the  layout*  of  individual 
processors  on  a  VLSI  ‘chip*  or  wafer,  but  the  structures  produced  are  simple  and  regular  enough 
that  the  problem  of  producing  layouts  is  tractable. 

Parallel  structure  synthesis  can  be  used  in  several  ways: 

►  to  set  up  routing  in  a  general  purpose  parallel  computer  such  as  the  Ultracomputer  [Sehwarts-80] 
or  the  Universal  Parallel  Computer  [GalPauLM] 

►  to  design  custom  VLSI 

►  to  control  the  configuration  phase  of  Wafer  Seale  Integration. 

In  particular,  we  have  studied  the  feasability  of  the  following: 

v  ‘classical”  anay  problems,  among  them  polynomial-time  dynamic  programming 

e  systolic  array  problems  such  as  matrix  multiplication,  polynomial  evaluation,  convolution 

[KungLd-7«] 

►  many  graph  problems  [HMS-8S]  [LlpVal-81] 

►  search  problems  such  as  pattern  matching,  unification  (part  of  inference),  and  combinatorial  space 
search. 


$$2.2  Outline  of  TransConS 

TransConS  is  intended  to  synthesise  parallel  structures,  first  semiautomatically  and  then 
automatically.  The  final  result  is  a  synthesis  system  with  an  efficiency  expert  [Katt-78]  that  es¬ 
timates  processor  and  memory  cost  as  well  as  time.  TransConS  asks  for  little  manual  assistance 
(depending  on  the  power  of  the  currently  integrated  theorem  prover). 


We  hm  previously  explored  techniques  that  produce  lattice  structures.  These  intrinsically  take 
linear  time  (in  the  extent  of  one  of  the  dimensions  of  the  problem)  to  compute  their  function.  We 
now  consider  the  synthesis  of  tree  structures,  in  which  it  is  reasonable  to  hope  for  solutions  to 
problems  in  logarithmic  time. 

In  this  report  we  discuss  the  synthesis  of  tree  structures  from  structures  consisting  of  chains  of 
processors  carrying  data  in  a  bucket  brigade  manner  from  one  end  to  the  other,  and  from  raw 
specifications  using  a  general  divide- and-conquer  formulation.  We  also  discuss  the  synthesis  of  a 
certain  systolic  structure  from  the  chain.  Additionally,  we  describe  relationships  between  parallel 
implementations  of  data  redistribution  problems,  prefix  summation  problems,  classification  prob¬ 
lems,  and  substring  matching.  We  show  how  these  problems  are  related  and  we  show  uniform 
techniques  for  producing  good  parallel  structures  for  all  of  these  problems. 


Figure  1.  Structure  of  the  Synthesis  Process 

The  programming  language  V,  in  which  TransConS,  its  inputs,  and  its  outputs  are  written,  is 
a  wide  spectrum  language  that  can  specify  conceptual  schemata,  specifications,  low-level  programs, 
program  transformation  rules,  and  (with  some  special  extensions)  processor  interconnections.  As 
far  as  V  is  used  in  this  report  it  will  be  self-explanatory.  The  PROCESSORS  statement,  used  to 
specify  multiple  processors  and  interconnections,  will  be  described  in  some  detail  in  the  Appendix. 


$3  Redistributional  Problems  -  Broadcast,  Census,  Up-and-Down 

In  this  section  we  consider  the  class  of  problems  in  which  data  ve  either  available  at  a  central  source 
and  where  there  is  an  opportunity  to  simultaneously  operate  on  this  data  at  multiple  sites,  or  where 
data  are  available  at  diverse  sources  and  where  it  is  necessary  to  summarise  this  data  in  a  single 
place.  We  alto  consider  problems  which  combine  these  features. 

We  start  with  a  high-level  specification  for  a  given  problem.  In  the  synthesis  process  we  develop  a 
low  level  specification  which  is  functionally  equivalent,  but  which  exhibits  concurrency  and  which 
describes  the  programs  to  be  run  on  the  various  processors. 

Suppose  we  have  a  broadcast  problem.  A  naive  solution  to  this  problem  is  a  chain  of  processors. 
See  Figure  2.  In  [King- 83]  we  studied  a  formal  method  to  obtain  such  a  configuration  from  the  high 
level  problem  specification. 


INPUT 


PI 


{P2]_ 


Pn 


Figure  6.  A  Bucket  Brigade  Chain 


An  analogous  technique  can  produce  a  chain  in  the  reverse  direction  (see  Figure  3)  for  computing, 
for  example,  the  sum  over  a  list.  Here  the  desired  value  is  computed  incrementally,  and  the  partial 
sum  is  passed  from  one  processor  to  the  next.  In  each  processor  a  new  contribution  is  added  in. 


OUTPUT 


Figure  3.  A  Collection  Chain 
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These  structures  ere  no  fester  then  lineer  in  the  site  of  the  problem.  We  explore  ways  to  improre 
this. 

TransConS  een  be  described  in  a  diagram  u  follows: 


In  this  peper  we  do  not  describe  the  first  step  of  the  above  diegrem.  This  was  studied  in  [Klng-M] . 
We  study  the  restructuring  transformations  of  the  second  step. 


$$3.1  Broadcasts 

In  this  subsection  we  discuss  the  problem  of  deriving  an  efficient  parallel  structure  for  broadeasting. 
There  are  three  alternative  paths:  A  simple  version,  the  bucket  brigade  chain  of  Figure  2,  can  be 
improved  to  a  tree;  the  chain  can  be  improved  into  a  systolic  array;  and  the  tree  structure  can  be 
synthesised  by  divide  and  conquer.  The  first  two  methods  will  be  described  below,  and  the  third 
will  be  discussed  in  a  later  section. 


3.1.1  Bottom-Up  Synthesis  of  a  Tree  from  a  Chain 

Tins  first  synthesis  path  restructures  a  bucket  brigade  chain  into  a  functionally  equivalent  tree.  It 
does  this  one  level  at  a  time.  The  arity  of  the  tree  is  arbitrary,  but  for  clarity  we  will  describe  the 
process  of  synthesising  binary  trees. 

Balanced  trees  can  be  built  from  chains  in  the  following  manner: 
v  Introduce  a  chain  of  new  processors  half  the  length  of  the  old. 

►  Forge  a  connection  from  each  element  t  of  the  new  chain  to  (a):  element  2i  or  (b):  elements  2i 
and  2s  +  1  of  the  old. 

►  Unless  they  are  needed  for  another  purpose,  (a):  cut  all  the  links  of  the  old  chain  or  (b):  cut  links 
between  elements  2i  +  1  and  2s  +  2  at  the  old  chain. 

►  The  first  element  of  the  old  chain  received  the  information  to  send  down  the  chain  over  a  link. 
Cut  that  link  and  forge  a  link  from  its  source  to  the  first  node  of  the  new  chain. 

e  Iteratively  apply  this  transformation  to  the  new  chain  until  it  consists  of  a  single  node. 

The  formal  transformation  rule  contains  parts  which  guarantee  applicability  and  assert  the  existence 
of  certain  functions  between  old  and  new  chain  indices.  This  it  discussed  in  detail  in  the  Appendix. 

The  above  outline  describes  two  variants  of  the  transformation,  a  and  b.  In  one  variant,  a,  all  of 
the  links  of  the  original  chain  are  cut,  and  we  achieve  a  balanced  binary  tree  (see  Figure  4).  In 
the  other  variant,  b,  only  one  of  the  links  from  the  new  chain  to  the  old  chain  is  forged,  and  the 
link  from  the  linked-to  old  chain  element  to  its  successor  is  not  cut.  See  Figure  5.  In  this  case  we 
achieve  a  slightly  unbalanced  tree,  but  the  depth  of  the  tree  it  the  same.  The  entire  structure  takes 
2596  fewer  nodes  than  the  one  of  Figure  4.  To  pay  for  this  nodes  such  as  Pi  must  be  able  to  relay 
messages  as  well  as  compute;  formerly  they  would  have  only  needed  to  compute. 

It  is  evident  that  the  new  parallel  structures  are  functionally  equivalent  to  the  old,  and  that  they 
are  faster.  Any  node  reachable  from  the  broadcut  source  in  the  old  parallel  structure  is  reachable 
in  the  new  one,  and  the  path  length  it  asymptotically  half  the  length.  There  is  no  consideration  of  a 
bottleneck  here,  because  we  are  assuming  that  the  same  data  it  received  by  all  of  the  leaf  processors, 
so  all  of  the  internal  nodes  receive  a  tingle  value  (or  the  same  set  of  values)  and  can  duplicate  that 
value  for  their  children. 


Figure  4.  Step  1  of  a  Transformation  Into  a  Balanced  Tree 
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Figure  5.  Step  1  of  a  Transformation  Into  an  Unbalanced  Tree 

The  transformation  of  Figure  5  leaves  corner  nodes  (marked  o)  buried  at  all  levels  of  the  tree;  not 
merely  just  above  the  leaves.  These  nodes  eventually  have  in*degree=l  and  out-degree=l.  They 
can  be  easily  removed,  resulting  in  the  25 %  saving  mentioned  above. 


3.1.3  Systolic  Structure  Synthesis 

We  now  study  distributional  problems  preferably  implemented  by  a  systolic  array.  For  a  specific 
example,  suppose  we  are  evaluating 

V*€((l,  ■  •  •  ,  0)  ;  must  be  enumerated  in  order 

V  >  €{1,  ■  • ,  n}  ;  may  be  enumerated  in  any  order 

Bj  «-  By  A, 

and  suppose  that  l  and  n  are  of  the  same  order  of  magnitude,  or  that  all  of  the  B- values  are  in  a 
single  place  and  we  choose  not  to  distribute  them.  A  systolic  structure  is  preferable  to  a  tree. 

Consider  the  structure  below: 


— (feed  in  1  A-vuiues) 


Figure  6.  Simple  Parallel  Structure  for  Broadcasting 


in  which  each  of  the  l  A-values  is  added  to  each  of  the  n  B- values. 

We  explicate  the  (  partial  sums,  using  virtualisation.  This  creates  a  separate  processor  for  each  step 
in  the  summation  process  for  each  of  the  B-values 
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This  parallel  structure  if  better  than  the  one  of  Figure  6  because  it  does  not  impc«e  *--*nuous 
requirements  on  the  I/O  capabilities  of  the  system.  In  Figure  6  it  is  not  shown  how  B- values  t«i 
to  the  various  B».  If  this  were  shown  we  would  see  that  the  assumption  was  made  either  that  thi 
broadcast  problem  was  embedded  in  a  larger  problem  that  allows  the  data  to  already  be  there  ot 
that  all  n  B-processors  HEAR  the  I/O  processor.  The  systolic  array  shown  above  allows  the  I/O 
processors  to  be  connected  to  only  a  single  processor. 

A  formal  presentation  of  these  techniques,  called  virtualisation  and  aggregation,  can  be  found  :n 
[King-83! . 


$$3.2  Census  Functions 

Trees  perform  broadcast  operations  well.  They  can  also  be  used  to  compute  a  class  of  functions 
called  Census  Function*  (see  [LlpVal-8l]).  Examples  of  census  functions  include  J]>  min,  and 
‘find  the  largest  subarray  of  the  array  that  contains  a  single  value* .  These  have  in  common  that 
they  are  functions  on  a  string  Oi,  a2,  . . . ,  an  which  can  be  grouped  in  an  arbitrary  manner 

It  is  possible  to  perform  a  census  function  by  using  a  collection  chain  (shown  in  Figure  3).  It  is 
always  possible  to  replace  such  a  structure  by  a  tree  structure  which  replaces  the  task  of  taking 
the  census  operation  on  a  string  of  length  n  by  the  task  of  taking  the  census  operation  on  a  string 
of  ‘sums*  whose  length  is  n/2.  The  structure  of  the  transformation  it  the  same  as  the  one  for 
broadcasts,  making  adjustments  for  the  fact  that  the  data  have  to  Sow  upwards  in  the  tree,  and 
that  if  a  segment  of  the  original  chain  is  allowed  to  remain  than  the  link  tying  it  to  the  higher 
level  chain  has  to  be  attached  to  the  end  of  the  segment.  TransConS  contains  transformations 
for  census  functions  as  well  a  broadcasts,  but  in  the  interests  of  brevity  these  are  not  presented. 
The  correctness  of  these  transformations  it  proved  in  a  similar  manner  to  that  of  the  broadcast 
transformations. 


$4  User- Assisted  Aggregation 

In  the  previous  section  we  presented  methods  for  generating  parallel  structures  with  a  tree  of 
communication  links.  In  this  section  we  use  the  results  of  that  section  as  a  building  block  as 
)  we  develop  techniques  to  solve  problems  such  as  the  normalisation  of  a  set  of  numbers,  standard 

deviation,  and  marking  the  largest  interval  of  elements  of  a  vector  that  satisfy  a  given  predicate. 
These  problems  have  in  common  that  information  must  travel  both  from  a  central  point  to  the 
elements  of  an  array,  and  vice  versa. 

The  naive  method  of  combining  two  applications  of  the  techniques  of  the  previous  section  would 
yield  a  pair  of  trees,  one  with  upward  communication  links  and  one  with  downward  ones.  This  uses 
*  S0%  more  nodes  than  the  problem  really  requires,  and  it  would  be  desirable  to  merge  these  trees 

in  the  obvious  manner.  This  involves  a  new  form  of  aggregation,  different  from  that  used  to  merge 
diagonal  collections  of  processors  in  the  previous  subsection.  There,  the  nodes  that  were  combined 
had  conceptually  wry  similar  roles.  Here,  this  will  not  be  the  case.  We  therefore  need  to  merge 
processors  in  more  general  ways  than  previously. 
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4.  UsB*-ASSlSTtD  Aggregation 


In  examples  sue!)  as  normalisation  tie  aggregation  is  fairly  simple  tc  find.  In  more  general  cases  such 
as  those  that  arise  from  graph  problems,  however,  this  would  not  be  so  easy.  Finding  aggregations 
across  ‘family*  boundaries  can  be  difficult  because  of  the  large  number  of  possibilities  involved. 
Human  intervene  ie-  is  necessary  here;  automation  is  difficult  for  the  following  reasons: 

►  The  bounds  of  arrays  are  often  arbitrary. 

►  There  are  many  aggregations  available;  it  it  not  clear  which  are  useful. 

►  It  is  not  uncommon  for  two  logical  data  to  share  parts  of  an  array. 

►  It  is  possible  for  one  array  to  match  the  combination  of  two  others. 

For  these  reasons,  TransCon'S  understands  the  AGGREGATE  statement,  of  the  form: 

AGGREGATE  N amt^mittr s  =  Psetsmtni 
HAS  Astt^ouni 

(HEARS  iters2jigund 

(USES  As  ef2*0UB^  ■  -  )) 

HAS 

This  statement  is  a  parameterized  statement.  The  iters  is  a  predicate  defining  permissible  bindings 
of  the  variables  in  the  list  bound.  It  means  that  those  processors  in  the  set  Psett,%lid  (which  is  a 
set- valued  expression)  are  aggregated  (i.e.  identified  to  form  a  single  processor  named  iVamekou%i) 
for  each  binding  of  the  variables  in  bound  that  is  permitted  by  the  predicates  in  iters.  It  is  explicitly 
permitted  that  the  set-valued  expression  can  include  enumerated  elements  and  explicit  setformers. 

The  HAS,  HEARS  and  USES  elements  are  analogous  to  those  of  a  PROCESSORS  statement  (see 
Appendix),  although  in  an  AGGREGATE  statement  there  can  be  more  than  one  HAS  clause. 
HEARS  clauses  are  associated  with  specific  HAS  clauses. 

When  the  user  provides  such  assistance  searching  a  potentially  enormous  set  of  possible  interfamiiy 
aggregations  is  avoided.  TransConS’s  abilities  are  used  to  check  the  validity  'f  the  user-proposed 
aggregation. 

The  following  consistency  checking  is  performed  on  AGGREGATE  declarations: 

(i)  formally  specified  conditions: 

►  Pset  is  disjoint  for  all  distinct  settings  of  bound  and  for  sill  settings  of  the  respective  bounds  for 
two  AGGREGATE  statements 

►  iVame((speciflc  bound))  HAS  (array  element)  iff  2 P(P)  E  Pset(( specific  bound))  that  HAS 
(array  element). 

if  Pname2—Name,  then 

'd  bound  s.t.  iters,  bound2—F[bound),  bound  yi  bound 2: 

3  PbS  €  P^PPsounii  P's*  €  PlCteounih- 
Pu  HEARS  PAUSES  A) 

(meaning  that  the  HEARS  clauses  of  the  AGGREGATE  statement  are  those  induced  by  the  under¬ 
lying  processors); 

(ii)  informally  specified  conditions: 

►  The  order  of  total  amount  of  computation  done  by  processors  underlying  a  given  node  does  not 
exceed  the  length  of  the  longest  chain. 

►  It  is  not  true  that  A  HEARS  B  and  B  HEARS  A  for  the  same  USES  datum.  (But  violation  of 
this  condition  is  likely  to  imply  violation  of  others) 

A  simple  example  of  the  utility  of  the  AGGREGATE  statement  can  be  the  aggregation  of  cor¬ 
responding  nodes  in  two  balanced  binary  trees;  one  resulting  from  the  identification  of  a  broadcast 
problem  and  one  from  the  identification  of  a  census  function  on  the  same  data. 
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$5  Parallel  Prefix  Computation 

In  this  lection  we  will  me  the  divide  and  conquer  icheme  to  derive  a  method  for  performing  a 
parallel  prefix  computation.  To  be  concrete  we  me  summation  in  what  follows,  although  *h«  methods 
described  are  general. 


i 


t 
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$$5.1  Introduction  to  the  Problem 

Consider  the  problem  of  forming  the  vector  of  partial  sums  of  the  form  V  s,  0  <  t  <  n— 1:»<  «— 
H)m.o  vi  (summation  in  place).  The  naive  solution  to  this  problem  is  (declarations  omitted) 

V.€{0l...(n-1}  do 

£  •> 
ieio . 0) 

od 

A  slightly  less  naive  version  is 
total*-  0 

V.€« 0 . n— 1))  do 

*-  total  *-  total  v, 
od 

which  does  allow  taking  advantage  of  the  cumulative  nature  of  the  calculation  (i.e.  that 
ai—  o  a)  +  *<)■  Probably  the  best  route  for  deriving  the  latter  from  the  former  is  formal 
differentiation  [Paige-79:  The  induction  variables  are  o'  and  a  new  variable  which  also  receives  the 
value  of  v\  in  the  second  line  of  the  first  fragment. 

We  build  a  structure  that  binds  the  vector  together  using  a  balanced  binary  tree.  Each  of  the  leaf 
nodes  starts  by  sending  its  value  to  its  parent.  Each  internal  node  accepts  a  value  from  its  left  son 
and  sends  that  value  to  its  right  son.  It  also  adds  the  values  from  its  two  sons  and  sends  that  to  its 
parent.  All  internal  nodes,  when  they  receive  a  value  from  their  parent,  send  it  to  their  two  children. 
All  leaf  nodes  add  any  vaiues  received  from  their  parent  into  their  contents.  See  Figure  7.  This 
structure  is  best  built  by  a  divide-and-conquer  scheme. 
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Figure  7.  Internal  Structure  of  a  Prefix  Computation  Network 

Our  model  should  be  distinguished  from  the  ‘standard*  one  for  parallel  prefix  circuits  (see  for 
example,  [Flch-83]  and  [LadTlsh-80])  because  we  permit  nodes  to  be  reused  and  because  we  require 
the  itk  element  of  the  answer  to  be  developed  in  the  same  place  at  the  i,k  element  of  the  input 
vector.  In  the  cited  previous  work  the  circuit  was  a  combinatorial  network  without  memory,  i.e.  it 
was  required  to  be  a  directed  acyclic  graph  and  no  node  would  be  reused. 


» 

$$5.2  The  Divide-and-Conquer  Formulation 

We  will  derive  the  tree  architecture  using  a  divide  and  conquer  scheme.  In  what  follows,  we  will 
use  divide  and  conquer  twice.  We  will  specify  the  unary  function  F(2)  in  terms  of  the  divide  and 
conquer  scheme  and  a  binary  infix  operator  s  0  2  which  adds  s  to  each  element  of  the  vector  2. 
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(The  operation  it  attorned  to  oe  addition  for  concreteness,  bat  the  method  it  general.)  We  will  then 
specify  ©. 

The  general  binary  divide  and  conquer  formulation  can  be  described  thus:  (=  is  function  definition) 

.  _  f  1 f  |C!=1  then  V 

~  t  otherwise  Combine[F[1Ol)lF( P*)) 

where  4  ||  B*=P  and  |p'|=l  Mii 

which  is  an  instance  of  the  standard  divide-and-eonquer  scheme.  It  remains  to  specify  the  function 
Combine. 

In  prefix  summation  the  Combine  operation  is  concatenation  of  the  two  rectors,  except  that  the 
sum  of  the  elements  of  the  left  rector  hare  to  be  added  to  the  right  rector.  The  sum  of  the  elements 
in  the  left  rector  is  always  the  left  rector’s  last  element. 

So  Combine  must  look  like  this,  to  update  the  right  half  of  the  concatenated  string: 

Combine (0,  £>)  =  D  ||  i»|0|  ©  * 

where  ©  adds  a  scalar  to  each  element  of  a  rector.  ©  is,  itself,  not  an  atomic  operation;  it  has  to 
be  specified.  We  can  define  it  as  follows: 

^  151=1  tl»*n  ((*  +  »l)> 

\  otherwise  a  ©  V1  ||  t  ©  if 
where  V'  ||  0"=«J  and  !»'!= 

which  is  itself  a  divide- and- conquer  formulation. 

This  formulation  leads  naturally  to  a  tree  structure.  The  time  requirement  is  2  lg  |0|  communication 
times  and  lg  |0i  computation  times  (additions). 


§6  Classification  Problems 

We  are  exploring  some  classifications  problems.  Specifically,  we  want  fast  concurrent  solutions  to 
two  problems  in  which  there  is  defined  an  equivalence  relationship  on  elements  of  a  rector  and  it  is 
desired  to  mark  a  representative  of  each  class  induced  by  the  partition  implied  by  the  equivalence 
relationship. 

There  are  two  variations  of  this  problem.  In  one  variation,  Ordered  Classification,  there  it  a  total 
ordering  on  the  equivalence  classes  and  any  representative  of  one  class  can  be  compared  with  a 
representative  of  the  tame  or  another  clast  quickly.  Is  this  cate,  it  is  obviously  correct  to  sort  the 
data  and  find  the  intervals.  Is  the  other  variation,  Unordered  Classification,  no  such  total  ordering 
exists  or  can  be  efficiently  computed.  It  it  therefore  necessary  to  test  the  equivalence  of  every  element 
of  the  vector  directly  against  every  other  element.  This  can  still  be  done  swiftly,  but  it  requires 
more  processors  to  do  these  comparisons. 

It  is  easy  to  see*  that  a  total  ordering  can  be  imposed  whenever  equivalence  can  be  tested,  but  it 
appears  that  the  cost  could  be  immense?. 

*The  domain  of  the  problem  can  be  Godel-numbered.  In  fact,  any  internal  representation  scheme  of  elements 
of  the  domain  imposes  such  a  numbering.  Order  elements  of  the  domain  by  the  smallest  (under  this 
numbering)  element  equivalent  to  a  given  element  under  integer  comparison  (or  lexicographic  comparison 
if  a  variable- length  internal  representation  it  being  used). 

?  Suppose  the  equivalence  test  costs  where  1  is  the  length  of  this  representation.  Clearly  F{t)  >  O(f). 

The  minimal  representation  equivalent  to  a  given  item  can  be  found  by  a  search  in  0(/'(I)2i'M).  It  is  not 
obvious  bow  to  do  better. 
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$$6.1  Ordered  Classification 

In  this  problem  the  desirable  approach  is  to  sort  the  input  rector,  find  adjacent  dumps  of  equivalent 
elements,  and  mark  the  first  element  of  each  clump.  There  are  three  significant  tasks  in  this;  the 
discovery  that  sorting  is  desirable,  the  prorision  of  a  sorting  parallel  structure,  and  the  marking  of 
first  elements  of  clumps. 

There  is  a  parallel  structure  that  sorts  in  O(logn)  using  0(n  log  n)  processors,  but  the  constants 
are  high  [AJKS-8S].  There  is  another  parallel  structure  that  sorts  in  0(log3n)  on  O(n)  processors 
[Bateher-68J .  There  is  a  parallel  structure  that  usually  sorts  in  O(logn)  on  O(n)  processors,  but  it 
has  a  possibility  of  failure  [RelfVal-8J].  (This  probability  tends  to  sero  rapidly  as  the  constants  or 
the  problem  sise  increase.) 

It  may  be  that  there  is  a  relation  •<  such  that  *  <  y  V  y  <  z,  that  i  ^  A  y  i  i  »  *  =  (, 
and  that  <  may  appear  in  the  resulting  parallel  structure.  If  TRansConS  u  presented  such  an 
instance  of  Classification,  it  -will  explore  synthesis  paths  consisting  of  a  sort  followed  by  a  prefix 
computation,  and  it  will  also  explore  a  parallel  structure  similar  to  the  one  described  in  the  next 
subsection.  It  will  try  to  select  the  more  efficient  one,  which  will  be  the  sort  and  prefix  structure  in 
exactly  those  cases  where  there  is  an  efficiently  computable  well  ordering.  There  are  therefore  two 
pieces  of  knowledge  that  are  part  of  TRansConS: 

►  the  knowledge  that  a  sort  should  be  considered  if  the  data  are  well  ordered,  and 

►  knowledge  of  several  ways  to  perform  a  sort,  and  the  tradeoffs  involved. 


$$6.2  Unordered  Classification 

'When  there  is  no  convenient  ordering  in  a  classification  problem,  TRansConS  explores  two  alter¬ 
native  parallel  structures.  One  is  fast  but  uses  many  processors;  the  other  is  slower  but  uses  fewer 
processors. 

The  first  structure  is  based  on  Leighton’s  Mesh  of  Trees  [Leighton-81].  In  this  arrangement  a 
rectangular  set  of  processors  exists  and  each  processor  is  responsible  for  comparing  an  element  of 
the  problem  array  with  a  (generally)  different  element  of  the  same  array  and  deciding  which  nodes 
it  knows  to  be  redundant  on  that  basis. 

One  of  the  sets  of  trees,  call  it  the  horisontal  set,  is  responsible  for  distributing  the  data  properly  to 
rows  of  nodes.  The  data  are  then  propagated  to  the  roots  of  the  vertical  set  of  trees,  and  then  to  the 
vertical  nodes.  The  t1**  element  it  then  in  row  »  and  in  column  t,  to  element  t,  j  of  the  rectangular 
array  of  processors  can  determine  whether  elements  t  and  j  are  equivalent.  This  information  it 
propagated  up  one  of  the  sets  of  trees  and  is  used  to  mark  the  roots  of  such  a  set.  The  processor 
count  is  <(wa)  and  the  time  it  8(logn).  Note  that  this  it  an  example  of  a  combination  of  the  census 
and  broadcast  techniques. 

A  trick  it  available  to  reduce  the  number  of  processors  to  9{n3/ log  n),  clearly  the  best  available 
in  that  time  because  of  the  number  of  comparisons  that  have  to  be  made.  Instead  of  n  columns, 
n/  log  n  columns  can  be  provided.  Each  node  is  responsible  for  performing  logn  comparisons  instead 
of  just  one,  but  this  only  stows  the  process  by  a  constant  factor. 

The  other  structure  is  slower  but  uses  fewer  processors.  It  can  be  synthesised  by  straightforward 
use  of  aggregation  on  the  larger  structure  (where  the  larger  structure  uses  chains  instead  of  trees). 


§7  String  and  Pattern  Matching 

The  last  problem  we  investigate  it  the  string  search  problem.  In  this  problem,  we  are  trying  to 
determine  the  position  of  the  first  occurrence  of  one  of  its  arguments,  a  string  (called  the  pattern),  in 
the  second  argument,  a  longer  string  (called  the  lest).  TRansConS  can  handle  a  string  matching 
problem  using  a  double  application  of  a  broadcast  tree  synthesis  followed  by  double  application  of 
a  census  tree  synthesis. 


t  Conclusions 
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It  is  dear  that  if  the  length  of  the  pattern  it  l  and  that  of  the  text  is  m,  then  the  simple  method  of 
trying  every  possible  position  of  the  pattern  within  the  text  will  use  ml  comparisons. 

The  base  form  of  the  problem  is 

"**“  - ,« JJ-iro  *'  6(1 . 

This  form  requires  0((|5|)(|5'|))  time.  TRANSCONS  requires  0((|S|)(|5'))  processors  to  solve  the 
problem  in  0(log\(|5!)(|5')))  time. 

Replacing  a  V  . . .  used  as  a  boolean  by  an  A...,  we  get 


retult 


min 

.eO-ISI-IS’l) 


A  S,+y— i 
i  €{1....|S'!} 


Virtualisation  around  both  the  A...  and  the  V  . . .  yields  the  two-dimensional  array  which  we  will 
call  equal. 


ARRAY  equalij,  y  €{1,  •  •  • ,  |5'!},  *  €{y . |S|-|5'|  4-  j) 

V  j  6{1, . .  ■ ,  IS'I),  i  €{y, . . . ,  |S|— |S'l  +  »  do 
equalij  —  S,=S' 

result «-  min 

»‘€{i . ISMS'!} 

produces  a  parallel  structure  in  which  chains  will  be  created  in  two  mutually  orthogonal  directions 
corresponding  to  the  two  dimensions  of  equal  (to  distribute  5  and  S'  characters),  and  along  each 
45°  diagonal  (to  collect  information  for  the  A  • . .  operation). 

There  will  be  three  collections  of  trees.  One  will  be  *horisontal*  and  a  second  “vertical*  in  the 
lattice  of  leaf  processors.  These  sets  derive  from  the  distribution  problem  inherent  in  this  approach 
to  pattern  matching.  A  third  collection  of  trees  is  diagonal.  Its  source  is  one  of  the  census  problem 
represented  by  the  A  of  the  specification.  There  is  another  tree  connecting  the  root  nodes  of  the 
diagonal  trees;  it  derives  from  the  min  census  function. 


$8  Conclusions 

In  this  paper  we  have  explored  the  problem  of  communication  among  a  large  number  of  processors 
when  the  nature  of  the  problem  is  such  that  either  large  amounts  of  data  must  be  summarised  in 
a  single  processor,  small  amounts  of  data  mutt  be  distributed  among  many  processors.  We  have 
also  explored  combinations  of  these  techniques.  By  design,  TransConS  is  able  to  combine  these 
techniques  and  others  to  produce  efficient  parallel  structures  from  high-level  specifications. 

We  have  alto  explored  tome  of  the  efficiency  issues  that  must  be  considered  A  systolic  array  can 
be  the  best  parallel  structure  for  one  of  these  problems.  Advanced  versions  of  TransConS  detect 
these  cases  and  synthesise  that  better  implementation. 

We  conclude  that  the  problem  of  producing  efficient  parallel  structures  for  the  class  of  problems 
discussed  in  this  paper  is  amenable  to  automation  through  the  use  of  a  transformational  system. 
We  arc  now  completing  the  design  of  and  implementing  TransConS,  a  testbed  for  these  techniques 
and  (hopefully)  a  practical  result  when  completed. 
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Technical  Appendix 


$A.l  Description  of  the  PROCESSORS  Statement 

The  concurrent  V  language  includei  a  PROCESSORS  statement  to  specify  concurrency. 

Any  part  of  the  PROCESSORS  statement  except  the  processors  definition  clause  can  be  made 
conditional  (evaluable  at  ‘compile  time*).  The  HAS  clause  describes  data  that  a  given  processor  is 
responsible  for.  The  HEARS  clause  defines  “wires*  that  can  be  used  to  receive  signals  from  other 
processors.  The  USES  clauses  attached  to  a  given  HEARS  clause  define  data  that  are  expected  to 
arrive  on  the  corresponding  wire.  This  is  useful  to  detect  groups  of  wires  that  allow  a  signal  to 
propagate  from  one  processor  to  another  one  that  is  not  directly  connected  to  it,  and  it  helps  to 
define  the  internal  program  of  that  processor  by  defining  an  allocation  of  space  in  an  internal  table 
associated  with  that  wire. 

The  example  below  is  a  PROCESSORS  statement.  It  describes  a  family  of  processors  named  P 
that  comprises  half  of  a  square  array,  cut  along  a  main  diagonal.  Each  processor  P(>m  in  this  family 
HAS  (is  responsible  for  computing)  an  array  element  A/,m.  The  processors  along  one  of  the  edges 
of  the  original  square  HEAR  (receive  a  connection  from)  a  processor  named  Q  (which  is  a  family 
containing  one  element)  because  the  Ith  element  of  that  edge  USES  (needs  the  value  of)  array  element 
ui.  Similarly,  Pi,m  HEARS  because  it  USES  for  any  1  <  k  <  m— 1.  We  call  the  USES 

dause(s)  attached  to  a  HEARS  clause  the  meli»etffn(*)  for  that  HEARS  clause. 

Observe  that  the  processors  as  nodes  and  the  HEARS  clauses  as  edges  constitute  a  graph.  If  a 
processor  HEARS  another  processor  for  a  USE  of  a  value  there  must  be  a  path  from  the  processor 
that  HAS  that  value  to  the  processor  that  USES  it,  and  the  last  edge  of  that  path  must  be  the 
HEARS  clause  attached  to  the  USES  clause. 

PROCESSORS  Pl  m,  1  <  m  <  n,  1  <  l  <  n— m  +  1  HAS  At  n 
If  m=l  then  HEARS  Q  (USES  v() 

If  2  <  m  <  n  then  HEARS  Pt,m-i  (USES  Aj,*,i  <  k  <  m-1) 

If  2  <  m  <  n  then  HEARS  P(+l,m_ x  (USES  l  <  k  <  m-1); 


$A.2  Transformations 

Suppose  wa  start  with  the  following  specification  of  a  broadcast  problem: 


A.2.  TRANtrOKMATION*  IS 

INPUT  ARRAY  o„  >6(1....,/) 

INPUT  ARRAY  bk,ke{l,...,n > 

OUTPUT  ARRAY  «*,  k  €{1,  •  •  ,  »> 

ARRAY  inUrfid^i  *  €{1, 

Vk€{l,...,n}  do 
internal *  *-  6* 

Vk€{l, ... ,  n)  do 
VJ6U1.  •  •  •,<))  do 

intirmit  *-  internal ^  -+■ 

Vk€{l,..  ,n>  do 
e*  «-  interna/* 

To  apply  (action  3  we  have  to  know. 

a  that  wt  indead  have  a  chain,  i.e.  that  there  it  a  first  processor,  a  tut  processor,  and  a  unique  path 
from  the  first  to  the  last  that  includes  all  of  the  processors  that  we  claim  are  in  the  chain. 

a  That  we  hare  a  function,  F.proceetor  indices  -*  integer »,  that  linearises  the  collection  of  proces¬ 
sors  properly.  F(i0)=0  where  i'o  is  the  index  of  the  first  processor  in  the  chain,  and  F(a)=F(b)-j-l 
if  processor  a  directly  HEARS  processor  l  in  the  chain.  If  there  are  many  coexisting  non- 
overlapping  (parallel)  chains  F  must  produce  multiple  linearisations,  i.e.  F:  processor  indices  — * 
integer  X  vector  to  that  F'(t0)s=((0,  vector))  it  to  is  the  index  of  the  first  processor  in  eng 
chain,  F(a)^((i,  V))  and  F{b)—{{j,  V))  is  true  iff  processors  a  and  b  are  in  the  tame  chain,  and 
F{a)=({i,  V))  and  F{b)=({i  -f  1 ,  V))  is  true  iff  processor  a  directly  HEARS  processor  b  in  the 
chain. 

►  F  has  an  inverse.  (This  allows  us  to  compute  the  processors  given  their  places  in  named  chains.) 

►  We  can  enumerate  the  chains,  i.e.  we  can  write  an  enumeration  that  gets  all  V  such  that 
F~lm,V)))  exists. 

v  Given  V,  it's  possible  to  determine  how  high  i  gets  such  that  ({t,  v))  is  in  the  range  of  F  tor  valid 
processor  id's. 

In  terms  of  the  notation  we  use  for  parallel  structures,  this  is  performed  using  the  following  steps: 

v  Build  a  new  PROCESSORS  statement  declaring  that  family  P  mentioned  below.  It  needs  to  use 
the  domain  finding  function  described  in  the  lait  group  of  items.  It  HAS  the  values  desired  by 
the  chain. 

a  Provide  this  new  PROCESSORS  statement  with  a  chain  connecting  the  nodes  in  linear  order. 
This  will  certainly  be  possible;  the  first  part  of  the  processor  index  exposes  the  linearity  explicitly 

v  Cut  the  chain  between  n- tiled  clusters  of  nodes  in  the  chain,  unless  there  is  another  need  for  these 
links.  This  can  be  accomplished  in  the  following  manner: 

►  The  HEARS  clause  of  the  chain  has  a  USES  clause  or  clauses  referring  to  the  values  that 
are  being  passed  down  the  chain.  If  we  need  a  parallel  structure  that  uses  fewer  nodes  (see 
discussion  below)  attach  a  condition  to  the  uses  dause(s)  for  these  values  that  inhibits  them  for 
processors  with  index  a  for  which  F (a)*((i,  V))  and  i  is  a  multiple  of  n.  If  a  straight  balanced 
tree  it  acceptable,  eliminate  the  uses  dause(s)  completely.  It  is  only  reasonable  to  use  the  “fewer 
nodes*  parallel  structure  when  n«2. 

a  Fabricate  a  new  HEARS  clause  that  hears  a  processor  P,/n,v  for  all  processors  meeting  the 
above  condition.  P1  it  a  new  family.  Let  this  HEARS  clause  have  a  USES  clause  that  uses  the 
vnlue(s)  described  above. 

a  Attach  conditions  to  the  chain's  HEARS  clause  that  inhibits  HEARing  if  none  of  the  conditions 
on  any  of  its  USES  clauses  is  active.  (This  will  eliminate  a  wire  if  the  only  purpose  of  the  chain 
was  its  bucket  brigade  function.) 

a  The  first  processor  of  each  chain  has  to  HEAR  the  same  processor  that  was  formerly  HEARd  by 
the  first  processor  of  the  original  chain. 


►  repeat  as  necessary. 


k 


T 


Technical  atpbndix  u 

The  implementation  of  thii  transformation  in  TRANSCONS  rules  is  as  follows: 

(rale  Halve- Cham  (**)  transform 
••PROGRAM 

A  Old-Pt:' PROCESSORS  PP(boundt)  vary  iters  has  AA(bounds2)  iters2  ; 

tf  P(boundt)  then  HEARS  PP(F(ietmd»))(USES  BB(jj)); 

» 

A  Old-Pt  6  •• 

A  (THEOREM  ((F(A)=F(B))  =>  A=B))  ;  1-1 

A  (THEOREM  {(F(A)=B)  =»  A=F*(B)))  ;  has  inverse 

A  (THEOREM  ((G(s,V)=A)  =»  G(i  +  l,V)=F*(A)))  ;  linearixable 

A  (THEOREM  ((G(s,V)=G0  ,  W))  =e  s=;  A  V=W))  ;  linearisation  is  1-1 

A  (THEOREM  (P(A)  eo  3  V|G(0,  V)=A))  ;  and  starts  from  0 

A  (THEOREM  ((P(G(0,  V))  =»  P(G(1,  V))))  ;  the  chains  each  have 

;  two  nodes 

A  (THEOREM  (A=G(H(A))))  ;  linearisation  has  inverse 
A  war  £  bounds  =»  —-(FREEEV  var  jj)  ;  different  processors  all 

;  want  the  same  info 

A  Ntwnodts  —  (GENSYM  'NODE)  ;  we’ll  want  to  create  the  new  chain 
A  Afew-Ps.-’PROCESSORS  PP(bounds)  vary  iters  has  AA(beunds2)  iters 2; 

If  Pounds)  then  HEARS  N  ewnodtt[[{H{BOU  N  DS)(\)/2),H{BOUN  DS)(2..)\) 

(USES  BB(jj)); 

;  lets  cut  the  links  in  the  old  chain  and 
;  forge  links  from  the  new  one  to  the  old 
;  one’s  nodes 

A  Neseer-Ps:‘PROCESSORS  N ewnodes(nbovnds)  vary 

iters[boundsmathrel\G(nbounds)}&ODDP(nbounds(l )) 

HAS0; 

If  nbeunds(l)  >  0 

then  HEARS  Ntwnodes(nbounds(l)—2,nbounde(2:...)) 

(USES  BB{jj))'  ;  and  build  the  new  chain 

Old-Ps  fE  •• 

A  New-P«€** 

A  Newer-Ps  €  •• 

) 

The  conditions  on  correctness  are  checked  by  the  THEOREM  assertions. 

The  other  option  discussed  in  that  section,  the  use  of  the  next- to-the- leaves  nodes  to  pus  data  and 
compute  simultaneously,  requires  different  rules,  shown  below; 

Rules  for  THIS  change  have  been  omitted  for  brevity. 


$A.3  Creation  of  a  Systolic  Structure  for  Broadcast 

We  can  synthesise  a  systolic  parallel  structure  from  the  base  form  of  the  broadcast  specification 
by  usigning  a  single  processor  to  each  recipient  of  the  broadeut,  for  each  such  processor  usigning 
a  column  of  processors  so  one  is  available  for  each  stage  of  the  broadeuting,  and  then  combining 
diagonal  sets  of  processors  into  new,  single  processors  in  a  manner  that  will  be  detailed  below.  We 
start  with: 

INPUT  ARRAY  «y,j€{l, .... 0 
INPUT  ARRAY  »»,  *  €{1,  ,  «> 

OUTPUT  ARRAY  c»,  k  €<1,  ,  n) 

ARRAY  internal^,  k  €{1, ...,»} 

Vfc €{!,...,»}  do 


* 


A_S  C it  cation  or  a  Systolic  Sthuctuk*  r or  Bhoaocajt 
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internal^  *-  bk 
Vi€{  1,  .,n>  do 

Vj€«l . Old* 

internal  t  *-  internal *  +  *; 

V*€{l,..  (n>  do 
s*  «-  interna^ 

u  the  virtualisation  step,  we  perform  a  proceiior  expansion,  create  a  chain  alone  the  k  axis  of  C' 
using  role  AS,  and  we  have: 


INPUT  ARRAY  o„  j  €{1, . . . ,  1} 

PROCESSORS  PA... 

INPUT  ARRAY  »*,  k  €{1, . . . .  n> 

PROCESSORS  PB ... 

OUTPUT  ARRAY  e*,  *  €{1, . . . ,  n> 

PROCESSORS  PC... 

PROCESSORS  Pinternatkvk  €{1, . . . ,  n},  e  €{0, . . . ,  1}  HAS  inttrna(i  v 
If  k  <  n  then  HEARS  Pinternafk^.1  ,(USES 
If  k—n  A  v  <  l  then  HEARS  Pinternafkv+l  (USES  aj) 

If  *=n  a  v—1  then  HEARS  PA(USES  a,)’ 

if  «=0  A  k  >  1  then  HEARS  Pint*rnal*_10(USES  &*) 

If  v=0  A  *=1  then  HEARS  PB( USES  »i) 

If  v  >  0  then  HEARS  Pinter  nafk  „_j(USES  internal*  „_j) 

ARRAY  internal'*  vk  €{1 . n} ,  v  €{0, ....  0 

(include  in  Pinternai*  „:) 
internafk§  <-  S* 

(If  j  >  0  Include  in  Pinternal*j-.) 

internafkj  —  internafk  j  —  a, 

(include  in  Pinter  natk t  :) 
ek  *-  intern al* , 

We  applied  certain  techniques  of  [KIng-83j  once  more  than  necessary  to  achieve  the  ‘countercurrent* 
effect  in  which  the  values  flow  in  opposite  directions.  This  is  necessary  to  make  the  virtualisation 
and  aggregation  work.  It  also  shows  the  importance  of  user  guidance. 

We  then  aggregate  by  identifying  Pmterna\j  =  Pinternafk+^_i  where  both  exist. 

We  get 

INPUT  ARRAY  a/J€{l . 1} 

PROCESSORS  PA.  . 

INPUT  ARRAY  t*,  k  6{1, . . . ,  n> 

PROCESSORS  PB... 

OUTPUT  ARRAY  ek,  k  €{1, . . . ,  n} 

PROCESSORS  PC... 

PROCESSORS  Pinttrna/L, 3 e€{0, . . .  ,i}jt'*«  +  tVte{l,...,n> 

HAS  internafk  v,  k  6{1,  •  • . ,  n), 

»€{0,...,n}, 

H**k  +  v 

lft'  <  n  +  1  then  HEARS  Pinternot^^ (USES  a«,e€{l, ...,!}) 

If  A'  >  -1  then  HEARS  Pinternaf£_1(  (USES  i*,t  €{!,...  ,n» 

(USES  internal*  „ 

*6{l,..,n>; 

«'€{0,...,n}, 
t'-l-t  +  v 

ARRAY  interna?k  vk€{  1, . . . ,  n},  e  €{C, . . . ,  1} 

(Include  in  Pinternal^(|t'€{l, . . . ,  t}:) 
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